'Humanity's Last Exam' benchmark is stumping top AI…

PM Images/Getty Images

Are artificial intelligence (AI) models really surpassing human ability? Or are current tests just too easy for them?

On Thursday, Scale AI and the Center for AI Safety (CAIS) released Humanity’s Last Exam (HLE), a new academic benchmark aiming to “test the limits of AI knowledge at the frontiers of human expertise,” Scale AI said in a release. The test consists of 3,000 text and multi-modal questions on more than 100 subjects like math, science, and humanities, submitted by experts in a variety of fields.

Also: Roll over, Darwin: How Google DeepMind’s ‘mind evolution’ could enhance AI thinking

Anthropic’s Michael Gerstenhaber, head of API technologies, noted to Bloomberg last fall that AI models frequently outpace benchmarks (part of why the Chatbot Arena leaderboard changes so rapidly when new models are released). For example, many LLMs now score over 90% on multi-task language understanding (MMLU), a commonly used benchmark. This is known as benchmark saturation.

By contrast, Scale reported that current models only answered less than 10 percent of the HLE benchmark’s questions correctly.

Researchers from the two organizations collected over 70,000 questions for HLE initially, narrowing them to 13,000 that were reviewed by human experts and then distilled once more into the final 3,000. They tested the questions on top models like OpenAI’s o1 and GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 1.5 Pro alongside the MMLU, MATH, and GPQA benchmarks.

“When I released the MATH benchmark — a challenging competition mathematics dataset — in 2021, the best model scored less than 10%; few predicted that scores higher than 90% would be achieved just three years later,” said Dan Hendrycks, CAIS co-founder and executive director. “Right now, Humanity’s Last Exam shows that there are still some expert closed-ended questions that models are not able to answer. We will see how long that lasts.”

Also: DeepSeek’s new open-source AI model can outperform o1 for a fraction of the cost

Scale and CAIS gave contributors cash prizes for the top questions: $5,000 went to each of the top 50, while the next best 500 received $500. Although the final questions are now public, the two organizations kept another set of questions private to address “model overfitting,” or when a model is so closely trained to a dataset that it is unable to make accurate predictions on new data.

The benchmark’s creators note that they are still accepting test questions, but will no longer award cash prizes, though contributors are eligible for co-authorship.

CAIS and Scale AI plan to release the dataset to researchers so that they can further study new AI systems and their limitations. You can view all benchmark and sample questions at lastexam.ai.

Source link

'Humanity's Last Exam' benchmark is stumping top AI models – can you do any better?

VMWARE

Helping Public Sector Organisations Define Cloud Strategy

How to change the VLAN ID of the Service Console in ESX from the command line/console

Cisco UCS and Vmware Interfaces (Vnics) HA Design Considerations

Troubleshooting network and TCP/UDP port connectivity issues on ESX/ESXi(2020669)

vSphere Client Parameters

Configuration Templates

CUE Licenses

Trouble shooting Unity Express with Call Manager Integeration & Operational Issues

CME Configuration Example: SIP Trunks to Viatalk and VoIP.ms

SIP Phone registration – CME Configuration

CUE Voicemail + VPIM networking (CUE to unity)

Related Post

VMWARE

Configuration Templates